A semantic partition based text mining model for document classification

نویسنده

  • Catherine Inibhunu
چکیده

Feature Extraction is a mechanism used to extract key phrases from any given text documents. This extraction can be weighted, ranked or semantic based. Weighted and Ranking based feature extraction normally assigns scores to extracted words based on various heuristics. Highest scoring words are seen as important. Semantic based extractions normally try to understand word meanings, and words with higher orientation based on a document context are picked as key features. Weighted and Ranking based feature extraction approaches are used for creating document summaries that can act as their representations in the absence of the original documents. However, these two approaches suffer from some major drawbacks: (1) summaries generated could contain words that seem irrelevant to the document context, (2) sentences containing some key words could be eliminated if ranked lower than a given threshold, (3) summaries must be processed further in order to serve as input for mining algorithms like the Apriori. This thesis proposes Semantic Partitions (SEM-P) and Enhanced Semantic Partitions (ESEM-P) algorithms based on the semantic orientation of words in a document. This partitioning reduces the amount of words required to represent each document as input for discovering word frequent patterns from a collection of documents, while still maintaining the semantics of the documents. A weighting and ranking heuristic measure for each word term in a partition is used in ESEM-P to prune low ranked terms resulting in improved performance of the ESEM-P over the SEM-P. Identified word frequent patterns are used to generate a document classification model.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...

متن کامل

A New Document Embedding Method for News Classification

Abstract- Text classification is one of the main tasks of natural language processing (NLP). In this task, documents are classified into pre-defined categories. There is lots of news spreading on the web. A text classifier can categorize news automatically and this facilitates and accelerates access to the news. The first step in text classification is to represent documents in a suitable way t...

متن کامل

Document Analysis And Classification Based On Passing Window

In this paper we present Document analysis and classification system to segment and classify contents of Arabic document images. This system includes preprocessing, document segmentation, feature extraction and document classification. A document image is enhanced in the preprocessing by removing noise, binarization, and detecting and correcting image skew. In document segmentation, an algorith...

متن کامل

Enhanced Semantic Preserved Concept Based Mining Model for Enhancing Document Clustering

The project “Enhanced semantic preserved concept based mining model for enhancing document clustering ” proposes the enhancement of data mining model for efficient informaion retreival . Concept based mining model is a challenging and a red hot field in the current scenario and has great importance in text categorization applications. A lot of research work has been done in this field but there...

متن کامل

Mining and its Application in Biomedical Domain

Semantic Text Mining and its Application in Biomedical Domain Illhoi Yoo Xiaohua Hu, Ph.D A huge amount of biomedical knowledge and novel discoveries have been produced and collected in text databases or digital libraries, such as MEDLINE, because the most natural form to store information is text. In order to cope with this pressing text information overload, text mining is employed. However, ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2018